16 research outputs found
Overfitting in Synthesis: Theory and Practice (Extended Version)
In syntax-guided synthesis (SyGuS), a synthesizer's goal is to automatically
generate a program belonging to a grammar of possible implementations that
meets a logical specification. We investigate a common limitation across
state-of-the-art SyGuS tools that perform counterexample-guided inductive
synthesis (CEGIS). We empirically observe that as the expressiveness of the
provided grammar increases, the performance of these tools degrades
significantly.
We claim that this degradation is not only due to a larger search space, but
also due to overfitting. We formally define this phenomenon and prove
no-free-lunch theorems for SyGuS, which reveal a fundamental tradeoff between
synthesizer performance and grammar expressiveness.
A standard approach to mitigate overfitting in machine learning is to run
multiple learners with varying expressiveness in parallel. We demonstrate that
this insight can immediately benefit existing SyGuS tools. We also propose a
novel single-threaded technique called hybrid enumeration that interleaves
different grammars and outperforms the winner of the 2018 SyGuS competition
(Inv track), solving more problems and achieving a mean speedup.Comment: 24 pages (5 pages of appendices), 7 figures, includes proofs of
theorem
FlashProfile: A Framework for Synthesizing Data Profiles
We address the problem of learning a syntactic profile for a collection of
strings, i.e. a set of regex-like patterns that succinctly describe the
syntactic variations in the strings. Real-world datasets, typically curated
from multiple sources, often contain data in various syntactic formats. Thus,
any data processing task is preceded by the critical step of data format
identification. However, manual inspection of data to identify the different
formats is infeasible in standard big-data scenarios.
Prior techniques are restricted to a small set of pre-defined patterns (e.g.
digits, letters, words, etc.), and provide no control over granularity of
profiles. We define syntactic profiling as a problem of clustering strings
based on syntactic similarity, followed by identifying patterns that succinctly
describe each cluster. We present a technique for synthesizing such profiles
over a given language of patterns, that also allows for interactive refinement
by requesting a desired number of clusters.
Using a state-of-the-art inductive synthesis framework, PROSE, we have
implemented our technique as FlashProfile. Across tasks over large
real datasets, we observe a median profiling time of only s.
Furthermore, we show that access to syntactic profiles may allow for more
accurate synthesis of programs, i.e. using fewer examples, in
programming-by-example (PBE) workflows such as FlashFill.Comment: 28 pages, SPLASH (OOPSLA) 201
Learning Nonlinear Loop Invariants with Gated Continuous Logic Networks (Extended Version)
Verifying real-world programs often requires inferring loop invariants with
nonlinear constraints. This is especially true in programs that perform many
numerical operations, such as control systems for avionics or industrial
plants. Recently, data-driven methods for loop invariant inference have shown
promise, especially on linear invariants. However, applying data-driven
inference to nonlinear loop invariants is challenging due to the large numbers
of and magnitudes of high-order terms, the potential for overfitting on a small
number of samples, and the large space of possible inequality bounds.
In this paper, we introduce a new neural architecture for general SMT
learning, the Gated Continuous Logic Network (G-CLN), and apply it to nonlinear
loop invariant learning. G-CLNs extend the Continuous Logic Network (CLN)
architecture with gating units and dropout, which allow the model to robustly
learn general invariants over large numbers of terms. To address overfitting
that arises from finite program sampling, we introduce fractional sampling---a
sound relaxation of loop semantics to continuous functions that facilitates
unbounded sampling on real domain. We additionally design a new CLN activation
function, the Piecewise Biased Quadratic Unit (PBQU), for naturally learning
tight inequality bounds.
We incorporate these methods into a nonlinear loop invariant inference system
that can learn general nonlinear loop invariants. We evaluate our system on a
benchmark of nonlinear loop invariants and show it solves 26 out of 27
problems, 3 more than prior work, with an average runtime of 53.3 seconds. We
further demonstrate the generic learning ability of G-CLNs by solving all 124
problems in the linear Code2Inv benchmark. We also perform a quantitative
stability evaluation and show G-CLNs have a convergence rate of on
quadratic problems, a improvement over CLN models
Invariant Synthesis for Incomplete Verification Engines
We propose a framework for synthesizing inductive invariants for incomplete
verification engines, which soundly reduce logical problems in undecidable
theories to decidable theories. Our framework is based on the counter-example
guided inductive synthesis principle (CEGIS) and allows verification engines to
communicate non-provability information to guide invariant synthesis. We show
precisely how the verification engine can compute such non-provability
information and how to build effective learning algorithms when invariants are
expressed as Boolean combinations of a fixed set of predicates. Moreover, we
evaluate our framework in two verification settings, one in which verification
engines need to handle quantified formulas and one in which verification
engines have to reason about heap properties expressed in an expressive but
undecidable separation logic. Our experiments show that our invariant synthesis
framework based on non-provability information can both effectively synthesize
inductive invariants and adequately strengthen contracts across a large suite
of programs
Recommended from our members
Data-Driven Learning of Invariants and Specifications
Although the program verification community has developed several techniques for analyzing software and formally proving their correctness, these techniques are too sophisticated for end users and require significant investment in terms of time and effort. In this dissertation, I present techniques that help programmers easily formalize the initial requirements for verifying their programs — specifications and inductive invariants. The proposed techniques leverage ideas from program synthesis and statistical learning to automatically generate these formal requirements from readily available program-related data, such as test cases, execution traces etc. I detail three of these data-driven learning techniques – FlashProfile and PIE for specification learning, and LoopInvGen for invariant learning.I conclude with some principles for building robust synthesis engines, which I learned while refining the aforementioned techniques. Since program synthesis is a form of function learning, it is perhaps unsurprising that some of the fundamental issues in program synthesis have also been explored in the machine learning community. I study one particular phenomenon — overfitting. I present a formalization of overfitting in program synthesis, and discuss two mitigation strategies, inspired by existing techniques